Read the training data and the pre-calculated tangent distance matrix:
load("zip.RData")
load("zip.dist.RData")
Perform MDS with tangent distance to reduce the original data to a 2D space:
library(MASS)
dat.mds <- isoMDS(zip.dist, k=2)
initial value 34.218873
final value 34.218662
converged
library(ggplot2)
ggplot(as.data.frame(dat.mds$points), aes(V1, V2, color=dat$train[,1])) +
geom_point() + coord_fixed() + labs(color="digit")
The digits are well-separated, better than PCA (same as MDS with Euclidean distance).
Subsample 1000 digits for Isomap.
set.seed(10)
ind <- sample(1:nrow(dat$train), 1000)
x <- as.matrix(dat$train[ind, -1])
y <- dat$train[ind, 1]
Perform Isomap with k=5 to find a 2D embedding:
library(RDRToolbox)
dat.isomap <- Isomap(data=x, dims=2, k=5)
Computing distance matrix ... done
Building graph with shortest paths (using 5 nearest neighbours) ... done
Computing low dimensional embedding ... done
number of samples: 1000
reduction from 256 to 2 dimensions
number of connected components in graph: 1
ggplot(as.data.frame(dat.isomap$dim2), aes(V1, V2, color=y)) +
geom_point() + coord_fixed() + labs(color="digit")
Try different parameter values (k) for Isomap:
k.vals <- c(2,3,5,10,25,50)
dat.isomap <- list()
for (i in 1:length(k.vals)) {
dat.isomap[[i]] <- Isomap(data=x, dims=2, k=k.vals[i])$dim2
}
Computing distance matrix ... done
Building graph with shortest paths (using 2 nearest neighbours) ... done
Computing low dimensional embedding ... done
number of samples: 1000
reduction from 256 to 2 dimensions
number of connected components in graph: 2
Computing distance matrix ... done
Building graph with shortest paths (using 3 nearest neighbours) ... done
Computing low dimensional embedding ... done
number of samples: 1000
reduction from 256 to 2 dimensions
number of connected components in graph: 1
Computing distance matrix ... done
Building graph with shortest paths (using 5 nearest neighbours) ... done
Computing low dimensional embedding ... done
number of samples: 1000
reduction from 256 to 2 dimensions
number of connected components in graph: 1
Computing distance matrix ... done
Building graph with shortest paths (using 10 nearest neighbours) ... done
Computing low dimensional embedding ... done
number of samples: 1000
reduction from 256 to 2 dimensions
number of connected components in graph: 1
Computing distance matrix ... done
Building graph with shortest paths (using 25 nearest neighbours) ... done
Computing low dimensional embedding ... done
number of samples: 1000
reduction from 256 to 2 dimensions
number of connected components in graph: 1
Computing distance matrix ... done
Building graph with shortest paths (using 50 nearest neighbours) ... done
Computing low dimensional embedding ... done
number of samples: 1000
reduction from 256 to 2 dimensions
number of connected components in graph: 1
library(gridExtra)
p <- list()
for(i in 1:length(k.vals)) {
p[[i]] <- ggplot(as.data.frame(dat.isomap[[i]]), aes(V1, V2, color=y)) +
geom_point(size=.1) + coord_fixed() + guides(color=guide_legend(title="digit")) +
ggtitle(paste('k =', k.vals[i])) +
theme(legend.position="none", plot.title=element_text(size=10), axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8), axis.title=element_blank())
}
do.call(grid.arrange, c(p, ncol=3))
The results are sensitive to k.
Apply tSNE with perplexity = 25 to find a 2D embedding of digits data:
library(Rtsne)
set.seed(10)
dat.tsne <- Rtsne(dat$train[,-1], dims=2, perplexity=25, verbose=T)
Read the 2219 x 50 data matrix successfully!
Using no_dims = 2, perplexity = 25.000000, and theta = 0.500000
Computing input similarities...
Normalizing input...
Building tree...
- point 0 of 2219
Done in 0.29 seconds (sparsity = 0.046597)!
Learning embedding...
Iteration 50: error is 79.918394 (50 iterations in 0.88 seconds)
Iteration 100: error is 72.216926 (50 iterations in 0.88 seconds)
Iteration 150: error is 71.396813 (50 iterations in 0.82 seconds)
Iteration 200: error is 71.113253 (50 iterations in 0.81 seconds)
Iteration 250: error is 70.967667 (50 iterations in 0.78 seconds)
Iteration 300: error is 1.925602 (50 iterations in 0.76 seconds)
Iteration 350: error is 1.638494 (50 iterations in 0.72 seconds)
Iteration 400: error is 1.508843 (50 iterations in 0.72 seconds)
Iteration 450: error is 1.436164 (50 iterations in 0.74 seconds)
Iteration 500: error is 1.400877 (50 iterations in 0.78 seconds)
Iteration 550: error is 1.380000 (50 iterations in 0.77 seconds)
Iteration 600: error is 1.365542 (50 iterations in 0.74 seconds)
Iteration 650: error is 1.354502 (50 iterations in 0.75 seconds)
Iteration 700: error is 1.345434 (50 iterations in 0.81 seconds)
Iteration 750: error is 1.338209 (50 iterations in 0.78 seconds)
Iteration 800: error is 1.332063 (50 iterations in 0.74 seconds)
Iteration 850: error is 1.326741 (50 iterations in 0.76 seconds)
Iteration 900: error is 1.322192 (50 iterations in 0.81 seconds)
Iteration 950: error is 1.318307 (50 iterations in 0.80 seconds)
Iteration 1000: error is 1.314517 (50 iterations in 0.76 seconds)
Fitting performed in 15.61 seconds.
ggplot(as.data.frame(dat.tsne$Y), aes(V1, V2, color=dat$train[,1])) +
geom_point() + coord_fixed() + labs(color="digit")
Try different perplexity values:
prep.vals <- c(2,5,10,25,50,100)
calc.tsne <- function(seed) {
dat.tsne <- list()
set.seed(seed)
for (i in 1:length(prep.vals)) {
dat.tsne[[i]] <- Rtsne(dat$train[,-1], dims=2, perplexity=prep.vals[i])$Y
}
dat.tsne
}
dat.tsne <- calc.tsne(seed=10)
plot.tsne <- function() {
p <- list()
for(i in 1:length(prep.vals)) {
p[[i]] <- ggplot(as.data.frame(dat.tsne[[i]]), aes(V1, V2, color=dat$train[,1])) +
geom_point(size=.1) + coord_fixed() + guides(color=guide_legend(title="digit")) +
ggtitle(paste('preplexity =', prep.vals[i])) +
theme(legend.position="none", plot.title=element_text(size=10), axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8), axis.title=element_blank())
}
do.call(grid.arrange, c(p, ncol=3))
}
plot.tsne()
The results are sensitive to perplexity.
Use different seeds, and the results are different since the global optimum is not guaranteed to be found.
dat.tsne <- calc.tsne(seed=100)
plot.tsne()
dat.tsne <- calc.tsne(seed=1000)
plot.tsne()
Apply KNN with one nearest neighbor to the test data:
library(class)
y.test.knn <- knn(dat$train[,-1], dat$test[,-1], dat$train[,1], k=1, prob=F)
Confusion matrix for the test data:
table(y.test.knn, dat$test[,1])
y.test.knn 1 3 5
1 261 0 1
3 2 157 5
5 1 9 154
Misclassification rate for the test data:
mean(y.test.knn != dat$test[,1])
[1] 0.03050847
Apply KNN with different k values (only include odd values since there would be much more ties when k is even):
set.seed(10)
k.vals <- seq(1, 21, by=2)
misrates <- sapply(k.vals, function(k) mean(knn(dat$train[,-1],dat$test[,-1],dat$train[,1],k=k,prob=F) != dat$test[,1]))
Plot the error rate against k:
qplot(k.vals, misrates, geom="line") + labs(x="k", y="error rate")
k = 1, 5 and 7 all yield same best result.
Function for plotting the digit data as an image:
conv.image <- function(vec)
{
mat <- matrix(as.numeric(vec), nrow=16, ncol=16)
mat <- -mat[, 16 : 1]
par(mar=c(0, 0, 0, 0))
image(mat, col=gray(seq(0, 1, 0.01)), xaxt='n', yaxt='n')
}
Here are digits that are actually 3’s but are misclassified into 5’s (k=1). Only the first one is plotted here, and the rest are saved into pdf files.
mis.three <- intersect(which(y.test.knn==5), which(dat$test[,1]==3))
conv.image(dat$test[mis.three[1], -1])
for (i in 1:length(mis.three)) {
pdf(paste0('mis.three.knn.', i, '.pdf'))
conv.image(dat$test[mis.three[i], -1])
dev.off()
}
Digits that are actually 5’s but are misclassified into 3’s:
mis.five <- intersect(which(y.test.knn==3), which(dat$test[,1]==5))
conv.image(dat$test[mis.five[1], -1])
for (i in 1:length(mis.five)) {
pdf(paste0('mis.five.knn.', i, '.pdf'))
conv.image(dat$test[mis.five[i], -1])
dev.off()
}
The QDA classifier stopped with error:
dat.qda <- qda(V1 ~ ., data=dat$train)
Error in qda.default(x, grouping, ...) : rank deficiency in group 1
Apply the LDA classifier and test it on the test data:
dat.lda <- lda(V1 ~ ., data=dat$train)
y.test.lda <- predict(dat.lda, dat$test)$class
Confusion matrix for the test data:
table(y.test.lda, dat$test[,1])
y.test.lda 1 3 5
1 259 0 1
3 4 151 11
5 1 15 148
Misclassification rate for the test data:
mean(y.test.lda != dat$test[,1])
[1] 0.05423729
Digits that are actually 3’s but are misclassified into 5’s:
mis.three <- intersect(which(y.test.lda==5), which(dat$test[,1]==3))
conv.image(dat$test[mis.three[1], -1])
for (i in 1:length(mis.three)) {
pdf(paste0('mis.three.lda.', i, '.pdf'))
conv.image(dat$test[mis.three[i], -1])
dev.off()
}
Digits that are actually 5’s but are misclassified into 3’s:
mis.five <- intersect(which(y.test.lda==3), which(dat$test[,1]==5))
conv.image(dat$test[mis.five[1], -1])
for (i in 1:length(mis.five)) {
pdf(paste0('mis.five.lda.', i, '.pdf'))
conv.image(dat$test[mis.five[i], -1])
dev.off()
}
Apply the High-Dimensional Regularized Discriminant Analysis, and use the train function in the caret package to tune the parameters. There are 3 tunable hyperparameters in this method: lambda (pooling parameter, which shifts the covariance-matrix towards pooled covariance or separate covariance), gamma (shrinkage parameter, which shifts the covriance-matrix towards/away from diagonal matrix), shrinkage_type (the type of covariance-matrix shrinkage, ridge or convex).
library(caret)
Loading required package: lattice
set.seed(1234)
train.control <- trainControl(method="repeatedcv", number=5, repeats=5, search="random")
dat.hdrda <- train(V1 ~ ., data=dat$train, method="hdrda", trControl=train.control)
y.test.hdrda <- predict(dat.hdrda, dat$test)
The results of the tuning process:
dat.hdrda
High-Dimensional Regularized Discriminant Analysis
2219 samples
256 predictor
3 classes: '1', '3', '5'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 1775, 1776, 1775, 1775, 1775, 1774, ...
Resampling results across tuning parameters:
gamma lambda shrinkage_type Accuracy Kappa
0.6092747 0.640310605 convex 0.9854015 0.9773447
0.6222994 0.860915384 ridge 0.9845002 0.9759442
0.6233794 0.009495756 convex 0.9912586 0.9864361
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were gamma = 0.6233794, lambda = 0.009495756 and shrinkage_type = convex.
Confusion matrix for the test data:
table(y.test.hdrda, dat$test[,1])
y.test.hdrda 1 3 5
1 261 0 0
3 2 154 2
5 1 12 158
Misclassification rate for the test data:
mean(y.test.hdrda != dat$test[,1])
[1] 0.02881356